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ABSTRACT 

A prototype system for the transliteration of diacritics-less 
Arabic manuscripts at the sub-word or part of Arabic word 
(PAW) level is developed. The system is able to read sub- 
words of the input manuscript using a set of skeleton-based 
features. A variation of the system is also developed which 
reads archigraphemic Arabic manuscripts, which are dot- 
less, into archigraphemes transliteration. In order to reduce 
the complexity of the original highly multiclass problem of 
sub- word recognition, it is redefined into a set of binary de- 
scriptor classifiers. The outputs of trained binary classifiers 
are combined to generate the sequence of sub-word letters. 
SVMs are used to learn the binary classifiers. Two specific 
Arabic databases have been developed to train and test the 
system. One of them is a database of the Naskh style. The 
initial results are promising. The systems could be trained 
on other scripts found in Arabic manuscripts. 

Categories and Subject Descriptors 

1.7.5 [Document and Text Processing]: Document Cap- 
ture — Document analysis, Graphics recognition and inter- 
pretation; 1.2.6 [Artificial intelligence]: Learning 

Keywords 

Optical shape recognition, Arabic language, Databases 

1. INTRODUCTION 

The special feature of Arabic manuscripts is the cursive na- 
ture of their scripts [1], which means that they are more ori- 
ented toward sub- words (letter-blocks or connected-components: 
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"CCs") than words. This is particularly true of pre-modern 
manuscripts, in which there is no measurable difference in 
the distances between sub-words and words. It is worth 
noting that sub- words (or part of Arabic word: PAW) in 
the Arabic language are any set of letters (letter-blocks) 
which are disconnected at the pixel level. To add to this 
complexity, the shapes of letters change according to their 
position within a sub- word (that is, each letter has various 
allographs). The presence of cavities inside the shapes fur- 
ther increases the complexity of these scripts, and special 
features are required to describe them. On top of all this, 
the high degree of intra-script variation makes the task of 
achieving a single solution for all Arabic scripts extremely 
difficult. The main challenge confronting these methods is 
the increase in their complexity when moving from a low- 
complexity database, such as city names, to full set of words 
in the language [2,3]. 

In this work, we use sub-words in order to skip the line- 
and word-segmentation problem encountered in pre-modern 
Arabic manuscripts. After recognizing All sub- words of a 
manuscript, its words should be reconstructed which is be- 
yond the scope of this work. We provide a complete recogni- 
tion chain at the sub-word level. It works directly with the 
sub- words, and does not try to break them into character 
segments. Therefore, we call it an Optical Shape Recogni- 
tion (OSR) system. The system uses a novel concept we 
call it the binary descriptor paradigm (we have also used 
the binary problem paradigm notation for it [4]). In this 
paradigm, a set of overlapping binary descriptors allows us 
to classify all sub-word classes without segmenting the sub- 
words into characters. To achieve this, we consider member- 
ship functions that arise from creating a new representation 
of sub- words in terms of their letters. For example, a sub- 
word "bkt", which is usually is represented by an ordered 
vector ("b", "k", "t"), will be represented by a set of binary 
descriptors {P^}^ where ^ counts for all letters, and is 
one of the binary descriptors used to describe the sub-words. 
As can be seen from the definition, the new representation is 
order-free, and therefore each descriptor can be processed 
independently. It is worth noting that, although there is no 



order in the set of binary descriptors, they can carry order 
information within themselves. For example, a binary de- 
scriptor could be if the second letter of the sub-word is letter 
"m" or not. For the sub- word "bkt", this binary descriptor 
will give 0, because the second letter is "k". More discussions 
of the binary descriptors paradigm is provided in section 3.1. 
It is worth noting that this concept has been previously pro- 
posed in [4]. However, in that work, only the general idea 
was discussed, and as a proof of concept, a small number 
of binary descriptors (problems) which had more than 1000 
positive samples was considered and learned with a promis- 
ing error range. Therefore, no attempt was made to recover 
the sequence of letters of a sub-word. In this work, not only 
is a complete set of binary descriptors considered even if the 
number of positive samples is very low, the system provides 
a set of candidate sequences of letters for each sub- word by 
combining the values of the binary descriptors. In order to 
obtain the full text, sub- words should be combined using 
language- level analysis and word distributions. This step is 
beyond this work, and will be addressed in future research. 

A schematic flow diagram of the proposed OSR system is 
shown in Figure 1. The input document images are pre- 
processed, and then the sub- words are extracted easily by 
identifying CCs. The feature vector of the sub- words is cal- 
culated according to their skeleton and some a priori in- 
formation, such as the average stroke width (see section 
2.1). The binary labels of each sub- word are obtained us- 
ing trained machines, and then are combined to generate 
candidate sequence of each sub- word letters. Using a dictio- 
nary of sub- words, the set of candidate sequences is pruned, 
and the flnal set of candidate sequences for each sub-word 
is provided as the output of the system. We use the ter- 
minology "string" for the sequence of sub-word letters to 
avoid confusion with the labels of the binary descriptors. 
Also, it is worth noting that by "Arabic-scripted language", 
we refer to all languages whose scripts are based on the 
Arabic script, including not only Arabic but the Persian, 
Urdu and Ottoman- Turkish languages. In other words, we 
do not limit the scope of letters to a specific language. How- 
ever, the scope will be automatically limited in the train- 
ing stage for each language and selected script and style. 
Script, style and language identification steps are ignored in 
this work. By training our system on different scripts, styles 
and languages, and adding associated identification steps, it 
can be used to read manuscripts from those languages and 
scripts. Databases are the building blocks of recognition 
systems [3-7]. For example, a database for the recognition 
of legal amounts and Arabic sub-words on Arabic checks 
has been developed [6] which contains 1547 legal amounts 
and 23325 sub- words. We used two databases in this work: 
i) an Arabic language database created from a real histori- 
cal manuscript, and ii) a synthesized archigraphemic- Arabic 
language database in Naskh style. Archigraphemic-Arabic 
ignores notation of dots. Therefore, letters with the same 
archigrapheme, such as ba' ^ and ta' appear exactly 
the same, and are represented by the dot-less ba' *^ . Be- 
cause difi^erentiation between these letters needs language- 
level analysis which is not included in the current system, the 
output of system will also be in archigraphemes. It is also 
worth noting that an archigraphemic- Arabic system can be 
used to recognize normal Arabic manuscripts by stripping 
out the dots before feeding the manuscript to the system. 



and then using a dot analysis which recovers Arabic letters 
from the Arabic archigraphemes using the dot information. 
A snapshot of the user interface of our system is shown 
in Figure 2. The user can easily click on a sub- word and 
the ground-truth sequences and the first-rank recognized se- 
quence will be shown. The databases are discussed further 
in section 2. 

We use support vector machines (SVMs) as the learning 
machines. They are trained as follows. Having the ground 
truth sequence of all the sub- words, the labels of the binary 
descriptors are generated and fed into the SVMs. An op- 
timization of the SVM parameters is also performed. We 
used two databases that are available to us for training and 
testing the proposed system. The first is the IBN SINA 
database [4] and the second is a new database that we have 
developed using a font system for Arabic scripts^ [8]. The 
same procedure was applied to both databases, the main 
difference being that archigrapheme encoding is used in the 
second database in place of the grapheme encoding used in 
the IBN SINA database. An archigrapheme is the bundle 
of shared features between two or more graphemes, minus 
their distinctive features (diacritics) [9] . Particularly for the 
Arabic script, an archigrapheme is a diacritics- less ductus 
of its associated graphemes [10]. The archigraphemes are 
shown in Figure 5. 

The organization of the paper is as follows. In section 2, 
more detail on the materials used in the development of the 
databases is provided. The procedure that we followed for 
training the SVMs and building the database dictionary is 
described in section 3. The performance of the whole system 
on the databases is presented in section 4. Finally, a discus- 
sion, our conclusions, and future prospects are provided in 
sections 5 and 6. 

2. TWO ARABIC SUB- WORD DATABASES 

Two databases were used in this work. The first is the IBN 
SINA database built based on manuscript images provided 
by the Institute of Islamic Studies (IIS), McGill University, 
Montreal. The author of the manuscript is Sayf al-Din 
Abu al-Hasan Ali ibn Abi Ali ibn Muhammad al-Amidi (d. 
1243A.D.). The title of the manuscript is Kitab Kashf al- 
tamwihat fi shark al-Tanbihat (Commentary on Ibn Sina's 
[i.e., Avicenna, d. 1037A.D.] al-Isharat wa-al-tanbihat) . Of 
all of his philosophical works, Ibn Sina's al-Isharat wa-al- 
tanbihat received the most attention from later philosophers 
and theologians. The database consists of 51 folios, and 
contains 20722 sub- words. 

The second dataset is based on Arabic Calligraphic Engine 
(ACE), which is a font-layout engine. ACE is developed to 
approach Arabic computer typography in complete analogy 
with pre- typographic text manufacture, and is the proof-of- 
concept for modern smart-font technology. Currently, only 
sub-words containing up to 3 letters have been added to 
the database, which from now on will be called the Naskh- 
3 database. In contrast to the IBN SINA database, in the 
Naskh-3 database, archigraphemes are the smallest unit to 
be recognized. As discussed in the introduction, our aim 
with this choice was to try an alternative recognition, in 

^Tasmeem: http: / /www. decotype.com/ 
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Figure 1: The flow diagram of the proposed OSR system. The structure of data is shown at each step in 
shaded boxes. 



which archigraphemes are recognized first, and then they 
are relabeled with their grapheme in a second round. In 
this way the high complexity of scripts can be addressed in 
two steps. Note that we target only the first step in this 
work, i.e., the recognition of archigraphemes. 

It is worth noting that our databases are diacritics-less, i.e., 
diacritical marks do not exist in them. In the case of the IBN 
SINA database, this is due to the nature of the manuscript 
used. We did not include diacritics in Naskh-3 database, 
in order to keep it simple. Developing a system that pro- 
cesses manuscripts with diacritics is beyond this work. Also, 
touching of sub-words is not applicable to our databases be- 
cause of the high-quality writing hands of their associated 
document images. 

2.1 Skeleton-based features 

We generate the skeleton-based features of each sub-word. 
A sample sub-word and its skeleton are shown in Figure 
3. Starting with the skeleton of a sub-word, its end points 
(EPs), branch points (BPs) and dots or singular points (SPs) 
are identified. Then, some features are assigned to each of 
these points depending on their connectivity to the others, 
in order to include the topology of the skeleton in the fea- 
tures. Some other features are assigned as well, to capture 
the geometrical relationship between the points. These fea- 
tures can be represented as a set of transformations from the 
skeleton image space to some associated feature spaces. Let 
us consider u and Uskei to be the images of a sub-word (CC) 
and its skeleton respectively, where Uskei '■ ^skei {O?!}? 
and Ctskei = C ^2 C M^. Cl is the domain of the whole 
page that is hosting the sub-word under study. Let us call T 
the set of our transformations that maps Uskei to the proper 
spaces: T = {Ti\Ti : Qskei ^ {R'^T' ,z = 1, • • • .ut} , where 
riT is the number of transforms and rrii depends only on the 
transformation T^, while rii depends on the complexity of 



Uskei as weh: 

Uskei ^ ^ = (1) 
(R-^)-^--sfceZ 3 /, = T, (Uskel) = 

It is worth noting that the transformation T varies according 
to the complexity of the sub- word. In other words, the di- 
mensions of the target feature spaces are not constant. The 
list of the features can be found in Figure 1. The details are 
as follows: 

1. Ti extracts features from BPs: 

• BHoleCon is 1 if the BP is connected to a hole, 

• BEPCon is 1 if it is connected to an EP, and 

• BBPCon is one if it is connected to another BP. 

2. T2 extracts features from EPs: 

• EBPCon is 1 if the EP is connected to a BP, 

• EEPCon is 1 if it is connected to another EP, and 

• ERelVertCMEP is positive if it is above the ver- 
tical center of mass of the sub-word. 

3. T3 extracts dot-related features of a BP: 

• BDotUpFlag is one if there is a dot above the BP, 
and 

• BDotDownFlag is one if there is a dot below it. 

4. T4 extracts dot-related features of a EP: EDotFlag is 
1 if there is a dot assigned to the EP. 

5. T5 extracts dot-related features of a dot: DRelVertCM- 
Dot is positive if the dot is above the vertical center of 
mass of the sub- word. 

6. Tq extracts dot-related features of an EP branch: 



4. # SPs is the number of singular points in the sub- word. 

5. Heightratio is the ratio of the sub- word height to the 
average text height. 
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Figure 2: A snap-shot of the system's user-interface. 
By cHcking on the sub-words, their sequences ap- 
pear on the image in FingHsh. Unicode fonts will be 
integrated to the interface soon. 

• ESShapeFlag is 1 if the branch is S-shape, 

• EClockwise is positive if it is clockwise, 

• EAboveltsBP is 1 if its EP is above its BP, and 

• EBelowItsBP is 1 if its EP is below its BP. 

Also, 8 global features are assigned which we consider them 
as Tg (see Table 2). The details are as follows: 

1. AR is Aspect ratio. 

2. HorizFreq is the number of peaks in the horizontal pro- 
file of the sub- word. 

3. VertCMRatio is the ratio of the vertical center of mass 
to the sub- word height. 
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Figure 3: A sample sub-word from the IBN SINA 
database and its skeleton image. The branch points 
(blue), end points (red), and singular points (green) 
are also shown on the skeleton image. 



6. HoleFlag is 1 if there is a hole in the sub-word. 

7. # EPs is the number of end points in the sub-word. 

8. DottedFlag is 1 if there is a dot in the sub-word. 

In order to have a coherent set of features for all the shapes, 
a limit on the number of different points, Ipoint, is assumed. 
In this way, if, for example, there are more than I point EPs 
for a sub-word, then all EPs after Ipoint are dropped. If the 
number of points is less than Ipoint, the rest of the vector 
will be filled with zeros. As Arabic manuscripts are written 
from right to left, the first Ipoint points from the right side 
of a sub-word is considered. In this work, we assume that 
Ipoint = 6, which means that 84 skeleton-based features are 
assigned to each shape. Adding the 8 global features, this 
brings the total number of features assigned to each shape to 
92: Xi = {xi,uj}^=i, where Xi is one of the features vectors, 
and uj is the index. A typical feature vector is shown in 
Table 3. It starts with the global features that are followed 
by the six occurrences of each T^, i = 1, • • • , 6. 



Tg 
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AR 
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Heightratio 
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HorizFreq 


6 


HoleFlag 
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VertCMRatio 


7 


#EPs 
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#SPs 


8 


DottedFlag 



Table 2: The 8 global features of a sub- word. 

Here, we provide a short discussion of the encoding system 
used. It is worth noting that the encoding system does not 
have a direct impact on the performance of systems. In the 
IBN SINA database, Finglish has been used (See Figure 4). 
We also used a more standard encoding system: Unicode^. 
For example, the Latin letter "a" which stands for the Arabic 
letter alif in Finglish is replaced by the UTF-8 representer 
of alif which has the Hex index 0627 in the Unicode table. 
As previously mentioned, we follow the archigrapheme en- 
coding in the Naskh-3 database (shown in Figure 5). In this 
encoding, dots are ignored. It is worth noting that dash and 
brackets referred to in this table will not appear in an ac- 
tual transliteration. Similarly, these Latin representers are 
replaced by Unicode representers. For example, Unicode 

representer 066E is used for the dot-less ba' ^ — ' (B in archi- 
grapheme encoding) which represents all ba'-like letters in 
archigrapheme encoding (see Figure 6 to see the correspon- 
dence between dot-less ba' and ba'-like letters). 

3. TRAINING AND BUILDING THE PRO- 
POSED OSR SYSTEM 

According to Figure 1, the sub- word labels are first con- 
verted to binary descriptors, and then SVMs are used for 
learning their behavior. The details are provided in the fol- 
lowing subsections. 

^http://unicode.org/charts/PDF/U0600.pdf 
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Table 1: The various feature vectors associated to BPs, EPs, and SPs of a sub- word. 
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Table 3: A typical sub-word feature vector composed of feature vectors in Tables 1 and 2. It has 92 elements. 
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Figure 5: The archigrapheme encoding used in this 
work. A dash before a Latin letter in this table 
means that that letter could only appear at the end 
of a sub- word. Brackets around a letter indicates 
that this letter also has the same form in the middle 
of a sub-word. 



3.1 Conversion of string labels to binary de- 
scriptors 

The binary-descriptor concept refers to a new way of ad- 
dressing the highly multi-class nature of sub-word labeling 
by defining a set of letter binary descriptors to redefine the 
labeling problem. Figure 7 illustrates this concept. If we 
assume the alphabet has just three letters, alif, ba' and ta', 
the possible combination of these letters (ignoring the or- 
der) can be visualized as the parts of the leaves in the fig- 
ure. Non-overlapping parts indicate single-letter sub- words, 
while overlapping parts correspond to sub-words composed 
of associated letters. In the binary-descriptor concept, each 
leaf is considered as a single binary descriptor. Therefore, 
the original highly multi-class problem can be replaced by 
an ensemble of binary descriptor classifiers which are easier 
to learn thanks to the existence of various state-of-the-art 
classification methods, such as SVMs, for binary descriptors. 
In [4] , only the binary descriptors that check for the presence 
of letters in sub- words were considered. In this way, the or- 
der is completely ignored. In this work, in order to increase 
the accuracy in recovering the correct order of letters, we 
add additional binary descriptors to generate implicit clues 
to the order of letters in the sub- word. For example, in addi- 
tion to the binary descriptors of the presence of letters in the 
sub- words, a similar figure to Figure 7 can be considered but 
now for the first letter of sub-words. The corresponding bi- 
nary descriptors will learn the presence of letters as the first 
letter of the sub- words. The same process can be performed 
for the second, the third, etc. letters of the sub-words. We 
use six different types of binary descriptors: 



Figure 6: An example of archigrapheme encoding: 
Archigrapheme dot-less ba' replaces all ba'-like let- 
ters shown on the right side. 



1- Pl,w'- For each letter Pl,w determines if that letter 
is present in the sub- word or not. 

2. Pt,w'- For each letter w, Pt,w determines if more than 
one instance of that letter is present in the sub-word 
or not. 

3. Pi,w'- For each letter Pi,w determines if that letter 
is the first letter of the sub- word or not. 



4. P2,w'- For each letter w, P2,w determines if that letter 
is the second letter of the sub- word or not. 



SVM parameters are optimized on the training set to select 
the best model. 



5. P3,w'- For each letter w, P3,w determines if that letter 
is the third letter of the sub- word or not. 

6. Ps,s'- For digit s, Ps,s determines if that digit is 1 or 
not in the binary representation of the length of the 
sub-word. For example, for a sub-word with 3 letters, 
the binary representation is 11. Therefore, Ps,i = 1, 
Ps,2 = 1, Ps,3 = 0, and so on. In this work, only 
s = 1, ■ ■ ■ ,4 are considered. 



For example, for a sub- word "1km", Pl,i = 1, PL,k = 1, 

PL,m = 1, Pl,l = 1, P2,k = 1, P3,m = 1, Ps,l = 1, Ps,2 = 1, 

and all the other descriptors are negative. In this example, 
Latin letters are used for the sake of simplicity. 




Figure 7: Concept of redefining the sub-words in 
terms of the letter binary descriptors. 

It is worth noting that for some binary descriptors the num- 
ber of positive samples is very small. However, in order to 
have a complete system, we trained SVMs on all the bi- 
nary descriptors. We are working to increase the size of 
the databases, especially the Naskh database, to include all 
possible sub-words of any possible length. 

3.2 Training of SVMs 

SVMs classifiers are used to learn the behavior of the bi- 
nary descriptors. They are a particular type of linear classi- 
fier based on the margin-maximization principle [11]. They 
are powerful classifiers, and have been used successfully in 
many pattern recognition problems [12]. In [4], we have used 
SVMs to learn a few binary descriptors with high number of 
positive samples. Here, we use the same approach to all bi- 
nary descriptors, trying to strike a balance between positive 
and negative populations. 

A radial basis function (RBF) kernel is used: k{xi^Xj) = 
exp(— 7||xi — Xjll^) where Xi and Xj are two typical feature 
vectors and 7 is the kernel parameter. Because all the binary 
descriptors we have are unbalanced, we use different hyper- 
parameters: C for controlling the training error impact; C+ 
for positive samples; and C- = C+ / Cj for negative samples, 
where Cj = n-/n+j with n+ and n_ representing the num- 
ber of positive and negative samples respectively [13]. The 



Also, in order to have a probability distribution of the out- 
puts, the following distribution is fitted on the outputs of 
the trained SVMs: 

y = l/{aay^ab) 
where y' is distributed between and 1. 

3.3 Reconstructing the sequence of letters from 
the binary labels 

With the trained SVMs, the system can generate the binary 
labels of each sub-word. The next step in the recognition 
process is to reconstruct the candidate sequences out of these 
labels. First, a set of sequences is built based on the outputs 
of the Pl,w and Pt,w descriptors by permuting the positive 
letters. Then, only those sequences that are compatible with 
the first letters indicated by the Pi^w descriptors are kept. 
The Ps,s are used to select the most probable letters from 
Pl,w and Pt,w, in order to build the sequences. 

The next step is to prune the candidate sequences based on 
a dictionary. The dictionary for each database was built 
by extracting all the strings associated with the database 
sub-words. Therefore, the size of dictionary is equal to the 
number of unique sub-words (Basis CCs; the BCCs) in the 
database. It is worth noting that, for the Naskh-3 database, 
the dictionary is based on archigraphemes. In future, we 
will use one of the Arabic corpora (for example, the freely 
available corpus [14]) to build the dictionary. 

4. EXPERIMENTAL RESULTS 

Because of the unbalanced number of positive and negative 
samples for each descriptor, the classic error rate (ER) is not 
a suitable measure. Instead, we used the balanced error rate 
(BER), which is the average of the misclassification rates on 
examples drawn from positive and negative classes. The 
BER is defined as follows: 
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FN + FP 



TP + FN + FP + TN 



FN 



■ + ■ 



FP 



TP + FN FP + TN 



where FN, TP, FP, and TN represent false negative, true 
positive^ false positive, and true negative respectively. In 
each run, the samples are divided into training and testing 
subsets. Eighty percent of the samples are considered to 
be in the training set. Because of the limited number of 
samples, cross validation has been used and the SVMs are 
trained to reduce the BER of the test set. The model with 
the minimum BER is kept as the output of the training 
process. 

Some of the statistics of the two databases are provided in 
Table 6. The performance of some of the individual binary 
descriptors for the IBN SINA database is provided in Table 
4. The corresponding Latin letters from the Finglish en- 
coding table are also provided in the FNC (Finglish code) 
column. As can be seen from the table, the complexity of 
the problem varies for different letters. Also, the number of 
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Table 4: The performance statistics of some of the 
SVMs trained on the IBN SIN A dataset. 



Descriptor Name 


ARC 


BER 


ER 


Cj 


Pl,0627 


A 


0.018 


0.034 


14.11 


Pl,066E 


B 


0.063 


0.10 


4.14 


Pi, 066 E 


B 


0.055 


0.10 


11.49 


-^2,0635 


C 


0.019 


0.034 


14.27 



Table 5: The performance statistics of the SVMs 
trained for some of the binary descriptors for the 
Naskh-3 dataset. 

samples influences the performance of the SVMs. It is worth 
noting that the number of positive samples is considerably 
smaller for the first-letter descriptors Pi,w compared to Pl 
descriptors. 

Having the trained SVMs, the OSR system is applied to the 
sub- words of the database according to Figure 1. The per- 
formance of the system is provided in Table 6. The error of 
letters set (ELS) calculates the average error of the recog- 
nized sub-word with respect to the ground truth, ignoring 
the position of letters in the sequence: 

ESL = average^ [0.5 {ESL{si, Si) + ESL{si, si)}] 

where Si and Si are a recognized sub-word and its associ- 
ated ground truth of the z*^ sub- word in the manuscript. 
ESL{si, S2) gives the error of the letters-set of si with re- 
spect to S2. In contrast, the recognition rate calculated 
provides the percentage of correctly labeled sub-words in 
the test set. The recognition rate of the first rank is equal 
to the all-ranks recognition rate thanks to the presence of 
first-letter descriptors in the system. The performance of 
the OSR system on the Naskh-3 database is also provided 
in Tables 5 and 6. The archigrapheme codes are shown in 
the ARC (archigrapheme code) column. We are working to 
achieve high performance by adding sub-words containing 
more than 3 letters to the database. 

5. DISCUSSIONS 



Dataset Name 


IBN SINA 


Naskh-3 


Number of sub-words (CCs) 


27709 


2920 


Size of sub-words dictionary 


1629 


2887 


Error in letters set (ELS) 


2.54 


0.18 


Recognition rate: first rank*^*-^ 


45.74 


51.83 


Recognition rate: first rank 


88.28 


95.59 


Recognition rate: all ranks 


89.66 


96.26 



Table 6: Statistics and performance of the proposed 
sub- word OSR system for the IBN SINA dataset and 
the Naskh-3 dataset. ^*^Without considering first-, 
second- and third-letter descriptors. 



We can conclude from Tables 4, 5 and 6 that the binary 
labels have been learned with a high level of performance 
(especially in the case of the Naskh-3 database, with as low 
as 0.034 percent error in letter recognition). Although the 
performance on the IBN SINA database is good, its lower 
performance may be associated with degradation of the in- 
put images, and also the limited number of samples. Im- 
provement of the skeleton-based features would provide a 
better description of the sub- words, potentially reducing the 
possibility of error at the letter-recognition level. 

Because the first-, second- and third- letter binary descrip- 
tors are used, there is less difference between the first rank 
and all rank scores. It is worth noting that in the Arabic lan- 
guage proper, a word's initial letter or letters often serve as 
grammatical markers, while the subsequent letters are usu- 
ally markers of the word's particular root meaning of the 
word. 

Finally, it should be noted that, although we use the first- 
letter binary descriptors in our set of descriptors, the com- 
plete set of features of each sub-word is used to learn and 
identify them. Therefore, the system is free of character- 
segmentation, and completely diff'erent from OCR methods. 

6. CONCLUSIONS AND FUTURE PROSPECTS 

A prototype Optical Shape Recognition system is developed 
that can provide the labels at the sub- word level. The sys- 
tem is able to recognize Arabic sub-words of the scripts 
on which the system has been trained. In order to avoid 
line/word segmentation, and also to avoid highly multi-class 
classification, equivalent binary descriptors are used. SVMs 
are trained to learn and classify these descriptors, and the 
outputs of the trained SVMs are combined to recover the 
original sequence of each sub- word letters. Also, the skeleton- 
based features used to describe the sub-words are robust 
with respect to possible variations in the size and direction of 
the strokes. The system has been separately trained/tested 
on two databases of difi^erent scripts. The second database 
is a synthesized database based on output from the ACE 
font layout engine for the Naskh style. 

Generalization of the system to generate the manuscript text 
is under consideration. We are also working on completing 
the Naskh-style database. Investigation of more descriptive 
skeleton-based features is yet another goal. We are also con- 
sidering combining our system with others, such as HMM 
and two-dimensional measures, in order to benefit from dif- 
ferent paradigms and improve the system. Evaluation of 
the method on other databases, such as IFN/ENIT, is un- 
der progress. 
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